feat: SM120 (Blackwell Desktop) support for DeepSeek-V4 inference by AliceChenyy · Pull Request #24692 · sgl-project/sglang

AliceChenyy · 2026-05-08T11:23:50Z

Summary

Adds full SM120 (RTX PRO 6000 / RTX 5090 / DGX Spark, compute 12.0) support for DeepSeek-V4/V3 on SGLang. SM120 desktop Blackwell GPUs lack TMEM, tcgen05, and DeepGEMM support — this PR provides Triton-based fallback kernels for all critical paths and enables CUDA graph capture.

Key changes

New kernels (7 files):

mxfp4_moe_sm120_triton.py — Triton fused MXFP4 dequant + GEMM for MoE experts (4.1x vs PyTorch per-GEMM)
flash_mla_sm120_triton.py — Triton FlashMLA sparse decode kernel (3.2–5.4x vs FlashInfer fallback)
sm120_mqa_triton.py — FP8 paged MQA with wq-precompute + vectorized batch (CUDA graph compatible)
flash_mla_sm120_fallback.py / sm120_mqa_fallback.py / mxfp4_moe_fallback.py — PyTorch fallback paths
test_sm120_mqa_fallback.py — Unit tests for MQA fallback correctness

SM120 guards (10 modified files):

DeepGEMM / tilelang MHC disabled on SM120 (no TMEM/tcgen05)
NSA backend: tilelang default, skip DeepGEMM metadata allocation
FlashMLA: SM120 adapter in deepseek_v4_backend.py
MoE: auto-select marlin backend on SM120
3 CUDA-graph-breaking paths fixed (.unique()/.item()/.nonzero() → vectorized)

Bug fix (found during latest-image validation):

FlashMLA Triton kernel: handle KV cache uint8 dtype (upstream changed from float8_e4m3fn)

Results (8× RTX PRO 6000, TP=8, CUDA graph)

On sglang:dev-cu13 (sgl-kernel 0.4.2.post1, PyTorch 2.11+cu130):

Metric	Value
GSM8K 5-shot (200q)	99.0% accuracy
Decode (BS=1)	11.40 tok/s (TPOT = 87.7ms)
CUDA graph capture	✅ all batch sizes captured

On older nightly-dev-20260430 (sgl-kernel 0.4.1, PyTorch 2.9.1+cu129):

Metric	Value
GSM8K 5-shot (200q)	98.0% accuracy
Decode (BS=1)	10.26 tok/s (TPOT = 97.5ms)
CUDA graph speedup	2.4× vs without (4.36 → 10.26 tok/s)

Motivation

SM120 is desktop Blackwell (RTX 5090, RTX PRO 6000) — no server-class features (TMEM, tcgen05, NVSwitch)
Prior to this PR, SGLang cannot run DSv4 on SM120 at all (DeepGEMM JIT crash, no MXFP4 MoE support)
Enables developer/researcher access to DSv4 on workstation GPUs

Notes

This is a rebase of feat: SM120 support for DeepSeek-V4 inference #24047 (which targeted deepseek_v4 branch) onto main
All SM120 kernel code is identical between the two PRs (verified by diff)
SM120 kernels are guarded by is_sm120_supported() — zero impact on SM100/SM103 paths
Environment variables: SGLANG_SM120_TRITON_FLASHMLA=1 (default on), SGLANG_SM120_MQA_FALLBACK=0 (default off)

Test plan

GSM8K 5-shot 200q: 99.0% on latest sglang:dev-cu13 image
GSM8K 5-shot 200q: 98.0% on older nightly-dev-20260430 image
Single request sanity check (15+27=42, capital of France, Chinese translation)
TPOT benchmark: 87.7ms BS=1 on dev-cu13 (11% faster than old image)
CUDA graph capture: all batch sizes captured successfully
Unit tests: test_sm120_mqa_fallback.py
PDL enabled on SM120: verified correct (GSM8K 10/10)
CI (no SM120 runner in CI — tested on local 8× RTX PRO 6000)

CI States

Latest PR Test (Base): ❌ Run #26232048205
Latest PR Test (Extra): ❌ Run #26232047397

gemini-code-assist

Code Review

This pull request introduces comprehensive support and fallbacks for NVIDIA SM120 (Blackwell Desktop) architectures, which lack certain hardware features like TMEM/tcgen05 required by DeepGEMM and FlashMLA. It implements both pure-PyTorch and optimized Triton fallback kernels for FlashMLA, MQA logits, and MXFP4 MoE operations. Additionally, it updates environment configurations, metadata handling, and server arguments to automatically detect SM120 and route to these compatible implementations. Review feedback highlighted misleading function aliasing in the MoE implementation, the replacement of descriptive assertion messages with "TODO" strings, and inconsistent backend selection logic in the new entrypoints.

gemini-code-assist · 2026-05-08T11:27:01Z

+            from sglang.srt.layers.moe.fused_moe_triton.mxfp4_moe_sm120_triton import (
+                mxfp4_moe_forward_triton as mxfp4_moe_forward_fallback,
+            )


The alias mxfp4_moe_forward_triton as mxfp4_moe_forward_fallback is highly misleading. It imports the optimized Triton implementation but names it as if it were the fallback. This is especially confusing given that a separate PyTorch fallback implementation exists in mxfp4_moe_fallback.py.

This can cause significant confusion during future maintenance. Please rename the alias to accurately reflect the imported function, for example, by simply importing mxfp4_moe_forward_triton and using that name directly.

Suggested change

from sglang.srt.layers.moe.fused_moe_triton.mxfp4_moe_sm120_triton import (

mxfp4_moe_forward_triton as mxfp4_moe_forward_fallback,

)

from sglang.srt.layers.moe.fused_moe_triton.mxfp4_moe_sm120_triton import (

mxfp4_moe_forward_triton,

)

Please address this comment

Done — now imports mxfp4_moe_forward_triton directly without aliasing.

gemini-code-assist · 2026-05-08T11:27:01Z

+    assert head_dim == 128, "TODO"
+    assert block_size == 64, "TODO"


The assert messages have been changed from descriptive explanations to simply "TODO". The previous implementation also had these asserts, but the original messages explained why these values were hardcoded (e.g., "torch reference impl hardcodes DSV4 indexer head_dim=128"). Please consider restoring the more descriptive messages to improve code maintainability.

Suggested change

assert head_dim == 128, "TODO"

assert block_size == 64, "TODO"

assert head_dim == 128, "Vectorized torch impl hardcodes DSV4 indexer head_dim=128"

assert block_size == 64, "Vectorized torch impl hardcodes block_size=64 cache layout"

Done — restored descriptive assert messages.

gemini-code-assist · 2026-05-08T11:27:01Z

+_use_triton_flashmla = os.environ.get("SGLANG_SM120_TRITON_FLASHMLA", "1") == "1"
+
+
+def flash_mla_with_kvcache_entrypoint(backend: str, **kwargs):


In flash_mla_with_kvcache_entrypoint, the backend selection logic for SM120 is confusing. The function takes a backend parameter (from SGLANG_HACK_FLASHMLA_BACKEND), but this parameter is ignored when _is_sm120 is true. Instead, the choice between the Triton kernel and the PyTorch fallback is controlled by a separate environment variable, SGLANG_SM120_TRITON_FLASHMLA.

To improve clarity, it would be better to unify this control flow. Consider using the backend parameter to also control the implementation path on SM120, for instance, by checking for values like "triton" or "torch".

Done — the SGLANG_HACK_FLASHMLA_BACKEND env var has been removed. On SM120 the backend is now hardcoded to "kernel" (Triton path), with SGLANG_SM120_TRITON_FLASHMLA env var as the only escape hatch to fall back to PyTorch attention if needed for debugging. This simplifies the control flow as the gemini-bot suggested.

Rename it to flash_mla_with_kvcache_sm120. Also, please move this function to a standalone file. It shouldn't be put in the fallback file

Created flash_mla_sm120.py as the standalone file with the function renamed to flash_mla_with_kvcache_sm120. Deleted the old flash_mla_sm120_fallback.py.

samuellees · 2026-05-10T07:26:02Z

Does here need add _is_sm120() ?

I think is_cuda covers SM120

I mean and not _is_sm120

Done — added and not _is_sm120 guards at all DeepGEMM/tilelang paths in nsa_backend.py.

samuellees · 2026-05-10T07:52:42Z

+    # ── Graph-safe routing: flatten topk assignments ──
+    # token_ids[slot] = which row of A (original token index)
+    # expert_ids[slot] = which expert's weights to use
+    flat_expert_ids = topk_ids.reshape(-1).contiguous()  # [M*topk]


topk_ids can be -1 for padded/filtered tokens, but this path passes them directly as Triton expert ids. Could we use safe_expert_ids = topk_ids.clamp_min(0) for loads and set the invalid slots' output to zero?? The PyTorch fallback already skips eid_val < 0

Handled — see lines 361-363: flat_expert_ids_raw.clamp(min=0) for safe indexing, plus line 441+: invalid slots are zeroed out after the kernel. The Triton kernel itself runs on clamped-to-0 expert IDs (safe for loads), and the output is masked to zero post-kernel.

samuellees · 2026-05-10T07:53:27Z

For SM120, should this CP ragged path explicitly raise NotImplementedError instead of falling through to deep_gemm.fp8_mqa_logits, since DeepGEMM is unsupported here?

Good catch. Will add if _is_sm120: raise NotImplementedError("CP ragged indexer not supported on SM120")

b8zhong · 2026-05-10T11:55:33Z

        return False
-    return get_jit_cuda_arch().major >= 9
+    arch = get_jit_cuda_arch()
+    # PDL requires SM100+ datacenter (tcgen05/TMEM); SM120 (desktop Blackwell) lacks these


Currently, there is a little bug on old CUTLASS version if the kernel is CUTLASS. But can you elaborate on PDL not working?

PDL is now re-enabled on SM120.

Address all reviewer feedback from PR sgl-project#24692: - Use is_sm120_supported() helper instead of raw sm_version checks - Guard SGLANG_OPT_DEEPGEMM_HC_PRENORM and SGLANG_OPT_USE_TILELANG_MHC_PRE with `not is_sm120_supported()` in deepseek_v4.py - Auto-select marlin MoE backend on SM120 in deepseek_v4_hook.py - Minor cleanups in indexer, metadata, nsa_backend, mxfp4_marlin_moe Fix FlashMLA Triton kernel garbled output on latest sglang:dev image: - Root cause: upstream changed KV cache dtype from float8_e4m3fn to uint8. The Triton kernel's as_strided() preserved the input dtype, so tl.load interpreted FP8 bit patterns as raw integers, corrupting attention scores. - Fix: explicitly view through uint8 → float8_e4m3fn before passing to Triton. Verified on sglang:dev-cu13 (sgl-kernel 0.4.2.post1, PyTorch 2.11+cu130): - GSM8K 5-shot 200q: 99.0% - Decode BS=1: 11.40 tok/s, TPOT 87.7ms Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds full SM120 (RTX PRO 6000 / RTX 5090 / DGX Spark) support for DeepSeek-V4 on SGLang, rebased onto main branch. Key changes: - Triton MXFP4 MoE kernel for SM120 (no MARLIN/tcgen05 on desktop Blackwell) - Triton FlashMLA sparse decode kernel for SM120 - MQA wq-precompute with vectorized batch for CUDA graph compatibility - DeepGEMM/PDL guards for SM120 (no TMEM/tcgen05) - NSA backend SM120 dispatch (tilelang default, skip DeepGEMM metadata) - FlashMLA SM120 adapter for deepseek_v4_backend - 3 CUDA-graph-breaking paths fixed (MoE .unique/.item, NSA/Compressed MQA) Results (8x RTX PRO 6000, TP=8): - Decode: 10.26 tok/s BS=1 with CUDA graph (2.4x vs without) - GSM8K 5-shot: 98.0% accuracy (200 questions) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Address all reviewer feedback from PR sgl-project#24692: - Use is_sm120_supported() helper instead of raw sm_version checks - Guard SGLANG_OPT_DEEPGEMM_HC_PRENORM and SGLANG_OPT_USE_TILELANG_MHC_PRE with `not is_sm120_supported()` in deepseek_v4.py - Auto-select marlin MoE backend on SM120 in deepseek_v4_hook.py - Minor cleanups in indexer, metadata, nsa_backend, mxfp4_marlin_moe Fix FlashMLA Triton kernel garbled output on latest sglang:dev image: - Root cause: upstream changed KV cache dtype from float8_e4m3fn to uint8. The Triton kernel's as_strided() preserved the input dtype, so tl.load interpreted FP8 bit patterns as raw integers, corrupting attention scores. - Fix: explicitly view through uint8 → float8_e4m3fn before passing to Triton. Verified on sglang:dev-cu13 (sgl-kernel 0.4.2.post1, PyTorch 2.11+cu130): - GSM8K 5-shot 200q: 99.0% - Decode BS=1: 11.40 tok/s, TPOT 87.7ms Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Skip DeepGEMM transform_sf_into_required_layout (tcgen05 unsupported) and topk_v2 (128KB SMEM exceeds SM120 99KB limit) on SM120. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rted() directly Address b8zhong review: do not define SM120 detection in this file, call the existing util is_sm120_supported() at each usage site instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

is_sm120_supported() already checks is_cuda() internally and is lru_cached, so the redundant `_is_cuda and` prefix is unnecessary. - metadata.py: remove _is_cuda/_is_sm120 module vars, call util directly - flash_mla_sm120_fallback.py: remove _is_cuda, simplify _is_sm120 - nsa_backend.py: remove redundant is_cuda() prefix - nsa_indexer.py: remove redundant _is_cuda prefix Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…orted() The previous commit removed _is_sm120 from metadata.py but indexer.py still imported it, causing DeepseekV4ForCausalLM to fail to register and cascading CI failures across all DSv4 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…quant) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Resolve conflicts from NSA→DSA rename (sgl-project#25821): - Migrate SM120 guards from nsa_indexer.py → dsa_indexer.py - Migrate SM120 guards from nsa_backend.py → dsa_backend.py - Move sm120_mqa_fallback.py, sm120_mqa_triton.py to dsa/ directory - Update test import path to dsa.sm120_mqa_fallback - Use is_sm120_supported() directly (no module-level _is_sm120 var)

…ctor The NSA→DSA rename (PR sgl-project#25821) changed all backend attributes from nsa_prefill_backend/nsa_decode_backend to dsa_prefill_backend/dsa_decode_backend, but the SM120 branches in _set_default_dsa_backends() still used the old names, silently creating orphan attributes while leaving dsa_* as None. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add CI registration wrapper so test_sm120_mqa_fallback runs on every PR. The tests are pure PyTorch and work on any CUDA GPU. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

samuellees · 2026-05-20T11:32:22Z

/tag-and-rerun-ci ++

samuellees · 2026-05-21T07:42:35Z

Hi @b8zhong @Fridge003 , seems the ci failures are not related with this PR. Do you any comments for that, pelase?
https://github.com/sgl-project/sglang/actions/runs/26156058484/job/77088750160?pr=24692#step:13:279
https://github.com/sgl-project/sglang/actions/runs/26156058484/job/77088750202?pr=24692

Fridge003 · 2026-05-21T08:05:26Z

Update cookbook for sm120 device usage: https://github.com/sgl-project/sglang/blob/main/docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-V4.mdx

Fridge003 · 2026-05-21T06:05:52Z

    max_seq_len: int,
    clean_logits: bool = True,
 ) -> torch.Tensor:
+    """CUDA-graph-compatible FP8 paged MQA logits (vectorized, no .item()).


Don't modify this function inplace. Please open a new function for fp8_paged_mqa for sm120 and put your implementation there

Done. Restored the original fp8_paged_mqa_logits_torch and created a separate fp8_paged_mqa_logits_torch_sm120 with the vectorized (no .item()) implementation. The dispatch in C4IndexerBackendMixin selects the SM120 variant when is_sm120_supported().

Fridge003 · 2026-05-21T06:14:24Z

+    # SM120: auto-select marlin MoE backend (dispatches to SM120 Triton kernel)
+    from sglang.srt.utils.common import is_sm120_supported
+
+    if is_sm120_supported() and server_args.moe_runner_backend == "auto":


It has already been set in server_args.py

Removed. The SM120 marlin auto-select is now only in server_args.py.

Fridge003 · 2026-05-21T06:14:51Z

            self.dsa_decode_backend = "tilelang"
        elif kv_cache_dtype == "fp8_e4m3":
-            if major >= 10:
+            if is_sm120_supported():


dsa_prefill_backend and dsa_decode_backend are for glm-5 and dpsk v3.2. REmove their settings

Removed both SM120 branches from _set_default_dsa_backends().

Fridge003 · 2026-05-21T06:15:13Z

@@ -0,0 +1,281 @@
+"""


Move these tests to main function of sm120_mqa_fallback.py and sm120_mqa_triton.py. Don't register them as CI tests

Removed the test file and CI registration. The source files (sm120_mqa_fallback.py, sm120_mqa_triton.py) under dsa/ were also removed per comment #6 (dead code for V4), so there is no target to move tests into.

Fridge003 · 2026-05-21T06:47:25Z

            return y, post, comb, False

-        if envs.SGLANG_OPT_USE_TILELANG_MHC_PRE.get():
+        if envs.SGLANG_OPT_USE_TILELANG_MHC_PRE.get() and not is_sm120_supported():


instead of adding these is_sm120_supported() in the if condition, can we auto set these environs to false when we detect sm120

Done. SM120 env auto-set is now in server_args.py under the DeepseekV4ForCausalLM block:

envs.SGLANG_OPT_FP8_WO_A_GEMM.set(False) envs.SGLANG_OPT_USE_TOPK_V2.set(False) envs.SGLANG_OPT_USE_TILELANG_MHC_PRE.set(False) envs.SGLANG_OPT_DEEPGEMM_HC_PRENORM.set(False) envs.SGLANG_FP8_PAGED_MQA_LOGITS_TORCH.set(True)

Removed all is_sm120_supported() guards from deepseek_v4.py. Also changed post_load_weights to use envs.SGLANG_OPT_FP8_WO_A_GEMM.get() instead of the module-level cached _FP8_WO_A_GEMM, since the env is set after module import.

Fridge003 · 2026-05-21T07:01:02Z

 from sglang.srt.layers.dp_attention import get_attention_tp_size
 from sglang.srt.model_executor.forward_batch_info import ForwardBatch, ForwardMode
 from sglang.srt.utils import is_cuda, is_hip
+from sglang.srt.utils.common import is_sm120_supported


This file is unrelated to dpsk v4. Don't modify it

Reverted all changes.

Fridge003 · 2026-05-21T07:27:54Z

    if (_is_cuda and sm_version < 90) or (_is_musa and sm_version < 31):
        return False
+    # DeepGEMM requires TMEM/tcgen05 (SM100+datacenter), not available on SM120
+    if is_sm120_supported():


Use sm_version == 120 here instead

Changed to if sm_version == 120: return False.

Fridge003 · 2026-05-21T08:00:32Z

-            )[0]
+            )
+
+            o = flash_mla_with_kvcache_entrypoint(**input_dict, backend="kernel")[0]


Don't modify in this way. Please only use the newly created entrypoint when it's sm120

Restructured: SM120 branch calls flash_mla_with_kvcache_sm120 from the new standalone file; the original flash_mla.flash_mla_with_kvcache(...) call is preserved unchanged in the else branch.

Fridge003 · 2026-05-21T08:06:59Z

@@ -0,0 +1,185 @@
+"""PyTorch fallback for MXFP4 MoE GEMM on SM120.


Remove it if it's not used anywhere

Confirmed no imports. Deleted.

Fridge003 · 2026-05-21T08:09:49Z

+_use_triton_flashmla = os.environ.get("SGLANG_SM120_TRITON_FLASHMLA", "1") == "1"
+
+
+def flash_mla_with_kvcache_entrypoint(backend: str, **kwargs):


Rename it to flash_mla_with_kvcache_sm120. Also, please move this function to a standalone file. It shouldn't be put in the fallback file

…move dead code - Restore original fp8_paged_mqa_logits_torch, add separate SM120 variant - Move flash_mla SM120 impl to standalone flash_mla_sm120.py - SM120-only dispatch in deepseek_v4_backend.py, original path untouched - Auto-set SM120-incompatible envs in server_args.py (FP8_WO_A_GEMM, TOPK_V2, TILELANG_MHC_PRE, DEEPGEMM_HC_PRENORM, FP8_PAGED_MQA_LOGITS_TORCH) - Remove is_sm120_supported() guards from deepseek_v4.py and metadata.py - Use sm_version == 120 in configurer.py - Revert DSA files (dsa_indexer.py, dsa_backend.py) — not used by V4 - Delete dead code: sm120_mqa_{fallback,triton}.py, mxfp4_moe_fallback.py, test_sm120_mqa_fallback.py, CI registration - Remove redundant SM120 marlin block from deepseek_v4_hook.py - Remove SM120 branches from _set_default_dsa_backends (GLM-5/V3.2 only) Verified: GSM8K 10/10 on 8xRTX PRO 6000 (SM120), no manual env vars needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add SM120 device to hardware table and deployment notes: - RTX PRO 6000 (8 GPUs) for V4-Flash - Docker image: lmsysorg/sglang:dev-cu13 - Launch command with auto-detected SM120 defaults - Performance and accuracy notes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AliceChenyy · 2026-05-21T11:47:44Z

Done. Added SM120 (RTX PRO 6000) section to the DeepSeek-V4 cookbook — covers TP=8 (recommended) and TP=4 (memory-constrained) configurations, Docker image, and performance notes. See commit 21c00bc.

DarkSharpness · 2026-05-21T14:10:04Z

    if is_hip_runtime():
        return False
-    return get_jit_cuda_arch().major >= 9
+    arch = get_jit_cuda_arch()


pls revert this unnecessary change

Done — reverted in cd44380.

…tils.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AliceChenyy requested review from 1am9trash, AniZpZ, BBuf, DarkSharpness, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, HydraQYH, Qiaolin-Yu, Ying1123, b8zhong, celve, ch-wan, hebiao064, hlu1, hubertlu-tw, ispobock, kkHuang-amd, merrymercy and yuan-luo as code owners May 8, 2026 11:23

github-actions Bot added deepseek jit-kernel labels May 8, 2026

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

AliceChenyy force-pushed the sm120-dsv4-rebase branch 3 times, most recently from 8779574 to a4b24dd Compare May 9, 2026 04:20

samuellees mentioned this pull request May 10, 2026

Feature/sm120 deepseek v4 highspeed inference support #24303

Open

5 tasks

samuellees reviewed May 10, 2026

View reviewed changes

b8zhong reviewed May 10, 2026

View reviewed changes

AliceChenyy requested a review from YAMY1234 as a code owner May 20, 2026 02:53

AliceChenyy force-pushed the sm120-dsv4-rebase branch from 8c8b8aa to 53191ba Compare May 20, 2026 03:21

AliceChenyy and others added 8 commits May 19, 2026 20:55

style: fix pre-commit lint issues (isort, ruff, black)

aa7000c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: wrap pytest.main in sys.exit for CI exit code propagation

3e33539

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add SM120 guards for sgl-kernel 0.4.2.post2 compatibility

9db9a66

Skip DeepGEMM transform_sf_into_required_layout (tcgen05 unsupported) and topk_v2 (128KB SMEM exceeds SM120 99KB limit) on SM120. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AliceChenyy force-pushed the sm120-dsv4-rebase branch from 53191ba to 88a2c05 Compare May 20, 2026 04:00

AliceChenyy and others added 4 commits May 19, 2026 22:17

style: remove unused imports flagged by ruff (rotate_activation, act_…

a3d06b9

…quant) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

test: register SM120 MQA fallback tests in CI (base-b, 1-gpu-small)

03be8ad

Add CI registration wrapper so test_sm120_mqa_fallback runs on every PR. The tests are pure PyTorch and work on any CUDA GPU. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fridge003 requested changes May 21, 2026

View reviewed changes

AliceChenyy and others added 2 commits May 21, 2026 04:13

AliceChenyy requested review from JustinTong0323, wisclmy0611 and zijiexia as code owners May 21, 2026 11:46

github-actions Bot added the documentation Improvements or additions to documentation label May 21, 2026

Merge remote-tracking branch 'origin/main' into sm120-dsv4-rebase

5f526de

DarkSharpness reviewed May 21, 2026

View reviewed changes

fix: revert unnecessary change to is_arch_support_pdl in jit_kernel/u…

cd44380

…tils.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

		assert head_dim == 128, "TODO"
		assert block_size == 64, "TODO"

		_use_triton_flashmla = os.environ.get("SGLANG_SM120_TRITON_FLASHMLA", "1") == "1"


		def flash_mla_with_kvcache_entrypoint(backend: str, **kwargs):

		@@ -0,0 +1,185 @@
		"""PyTorch fallback for MXFP4 MoE GEMM on SM120.

Conversation

AliceChenyy commented May 8, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes

Results (8× RTX PRO 6000, TP=8, CUDA graph)

Motivation

Notes

Test plan

CI States

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samuellees commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samuellees commented May 21, 2026

Uh oh!

Fridge003 commented May 21, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

AliceChenyy commented May 8, 2026 •

edited by github-actions Bot

Loading

samuellees commented May 20, 2026 •

edited

Loading